Business Analytics

Writing Functions

Ayush Patel and Jayati Sharma

12 February, 2024

Pre-requisite

You already….

  • Know basic and advanced data wrangling functions in R
  • Know basics of data visualization in R
  • Know univariate and multivariate linear regression

Before we begin

Please install and load the following packages

library(dplyr)
library(tidyverse)



Access lecture slide from the course landing page

About me

I am Ayush.

I am a researcher working at the intersection of data, law, development and economics.

I teach Data Science using R at Gokhale Institute of Politics and Economics

I am a RStudio (Posit) certified tidyverse Instructor.

I am a Researcher at Oxford Poverty and Human development Initiative (OPHI), at the University of Oxford.

Reach me

ayush.ap58@gmail.com

ayush.patel@gipe.ac.in

Learning Objectives

  • Learn how to write functions in R
  • Learn how to perform iterations

How can we do this better?

  • Assume that you are working on the following dataset
  • You want to multiply the variables with a certain number 0.837636
  • One way to do this is as shown below
data <- data.frame(id = rep(letters[8 :10], each = 2), 
                   a = seq(5 ,10), b = seq(1,6), 
                   c = seq(32,37), d = seq(19,24))

new_data <- data %>%
  mutate(new_a = 0.837636 * a, 
         new_b = 0.837636 * b,
         new_c = 0.837236 * c,
         new_d = 0.837636 * d)

How can we do this better?

  • However, there are two problems with this
  • This process can get tedious as the number of operations increase
  • Also, there can be mistakes in copy pasting the number, as can be seen in the third line of mutate() function
  • In such cases, it is better to write functions

Writing a basic function

  • For the previous problem, we can instead write a function to perform the same operation many times
  • The idea is to give it a name of your choice, use function() and specify the argument x, then wrap up your argument and specify the operation inside {}
  • You will soon learn how to write functions in detail
myfirstfunction <- function(x){0.837636 * x}

 data %>%
  mutate(new_a = myfirstfunction(a), 
         new_b = myfirstfunction(b),
         new_c = myfirstfunction(c),
         new_d = myfirstfunction(d))
  id  a b  c  d    new_a    new_b    new_c    new_d
1  h  5 1 32 19 4.188180 0.837636 26.80435 15.91508
2  h  6 2 33 20 5.025816 1.675272 27.64199 16.75272
3  i  7 3 34 21 5.863452 2.512908 28.47962 17.59036
4  i  8 4 35 22 6.701088 3.350544 29.31726 18.42799
5  j  9 5 36 23 7.538724 4.188180 30.15490 19.26563
6  j 10 6 37 24 8.376360 5.025816 30.99253 20.10326

Writing a basic function

  • We can further modify our code to make it more clean
data %>%
  mutate(across(a :d, myfirstfunction))
  id        a        b        c        d
1  h 4.188180 0.837636 26.80435 15.91508
2  h 5.025816 1.675272 27.64199 16.75272
3  i 5.863452 2.512908 28.47962 17.59036
4  i 6.701088 3.350544 29.31726 18.42799
5  j 7.538724 4.188180 30.15490 19.26563
6  j 8.376360 5.025816 30.99253 20.10326

Why to write a function?

Content for this topic has been sourced from Hadley Wickham’s ‘R for Data Science (2e)’. Please check out his work for detailed information.

  • It can be given any name, which makes the code easier to understand
  • Easier to update code
  • Reduces the chances of making mistakes while copying and pasting values
  • Overall makes the code clearer for future use
  • Good rule of thumb - Consider writing a function whenever you’ve copied and pasted a block of code more than twice

Using conditions in functions

  • Let us move to writing more complex functions
  • Logic remains same; you add more conditions
  • Using data, you want to multiply 2 to an even number, or keep the number as it is if odd
logic_multiply <- function(x){ifelse( x %% 2 == 0, x*2,x)}
  • This function can now be used to transform data
data %>%
  mutate(across(a : d , logic_multiply))
  id  a  b  c  d
1  h  5  1 64 19
2  h 12  4 33 40
3  i  7  3 68 21
4  i 16  8 35 44
5  j  9  5 72 23
6  j 20 12 37 48

Do It Yourself -1

  • Using data, multiply 1.77364 to all numeric variables by writing a function
  • Write a function to find square root of all nuemric variables in data
  • Write a function that multiplies the number by 2 if it is greater than 10, else keeps the number as it is
  • Write a function that multiplies the number by 2 if it is greater than 20, else multiplies it by 3 it is

Data Frame Functions

Content for this topic has been sourced from Hadley Wickham’s ‘R for Data Science (2e)’. Please check out his work for detailed information.

  • Beyond vector functions, it is also possible to write data frame functions
  • Data frame functions work like vector functions: they take a data frame as the first argument, some extra arguments that say what to do with it, and return a data frame or a vector
  • Suppose you want to perform a function on data where you want to group the variables by id and get the mean of numeric variables

Data Frame Functions

Content for this topic has been sourced from Hadley Wickham’s ‘R for Data Science (2e)’. Please check out his work for detailed information.

  • We can write the function easily using the arguments
  • Note: when we use variables from a data frame, we need to embrace them using {{ }}
  • This is because of indirection
  • Embracing a variable tells dplyr to use the value stored inside the argument, not the argument as the literal variable name
grouped_mean <- function(dataframe, group_var, mean_var) {
  dataframe %>%
    group_by({{group_var}}) %>%
    summarize(mean({{mean_var}}))}

grouped_mean(data, id, a)
# A tibble: 3 × 2
  id    `mean(a)`
  <chr>     <dbl>
1 h           5.5
2 i           7.5
3 j           9.5

Data Frame Functions

Content for this topic has been sourced from Hadley Wickham’s ‘R for Data Science (2e)’. Please check out his work for detailed information.

  • A common use for such functions is when you do your exploratory data analysis
  • You might want to see a certain set of descriptive statistics
  • Writing a function would b useful in such a case
my_eda_function <- function(dataframe, variable) {
  dataframe %>%
    summarize(
      count = n(),
      minimum_value = min({{ variable}}, na.rm = TRUE),
      maximum_value = max({{ variable}}, na.rm = TRUE),
      range = max({{ variable}}, na.rm = TRUE) - min({{ variable}}, na.rm = TRUE)
    )
}

my_eda_function(data,a)
  count minimum_value maximum_value range
1     6             5            10     5

Do It Yourself -2

  • Using the ChickWeight data from datasets, create your own summary function that gives the minimum, maximum and range of the weight variable
  • Write a function that groups by diet and gives the weight

Plot Functions

Content for this topic has been sourced from Hadley Wickham’s ‘R for Data Science (2e)’. Please check out his work for detailed information.

  • Along with returning a dataframe, functions can also return dataframes
  • Use the diamonds dataset from ggplot2
  • Suppose you want to create many histograms
  • Repeating each code with a difference in binwidth can be avoided

Plot Functions - Example 1

Content for this topic has been sourced from Hadley Wickham’s ‘R for Data Science (2e)’. Please check out his work for detailed information.

  • Instead, a function an be written which reduces the task of making a plot everytime
diamond_histogram_function <- function(data, variable, binwidth){
  data %>%
    ggplot(aes(x = {{ variable}})) +
    geom_histogram(binwidth = binwidth)
}
diamond_histogram_function(diamonds, carat, 0.1)
  • Can you guess why did we not embrace binwidth while writing the function?

Plot Functions with Data Wrangling

Content for this topic has been sourced from Hadley Wickham’s ‘R for Data Science (2e)’. Please check out his work for detailed information.

  • Plot functions can be combined with data wrangling functions as well
  • To make a bar graph with the bars reordered in the descending order
diamonds_sorted_bars <- function(dataframe, variable) {
  dataframe %>% 
    mutate({{ variable }} := fct_rev(fct_infreq({{ variable }})))  |>
    ggplot(aes(y = {{ variable }})) +
    geom_bar() +
    theme_minimal()
}


diamonds %>%
  diamonds_sorted_bars(clarity)

Do It Yourself -3

  • Using diamonds dataset, write a function for making a bar graph, and then using the function, plot cut and color
  • Write a function to plot histogram of varying binwidths, and then plot variables x and y using it

Thank You :)